42 research outputs found

    Accelerating Science: A Computing Research Agenda

    Full text link
    The emergence of "big data" offers unprecedented opportunities for not only accelerating scientific advances but also enabling new modes of discovery. Scientific progress in many disciplines is increasingly enabled by our ability to examine natural phenomena through the computational lens, i.e., using algorithmic or information processing abstractions of the underlying processes; and our ability to acquire, share, integrate and analyze disparate types of data. However, there is a huge gap between our ability to acquire, store, and process data and our ability to make effective use of the data to advance discovery. Despite successful automation of routine aspects of data management and analytics, most elements of the scientific process currently require considerable human expertise and effort. Accelerating science to keep pace with the rate of data acquisition and data processing calls for the development of algorithmic or information processing abstractions, coupled with formal methods and tools for modeling and simulation of natural processes as well as major innovations in cognitive tools for scientists, i.e., computational tools that leverage and extend the reach of human intellect, and partner with humans on a broad range of tasks in scientific discovery (e.g., identifying, prioritizing formulating questions, designing, prioritizing and executing experiments designed to answer a chosen question, drawing inferences and evaluating the results, and formulating new questions, in a closed-loop fashion). This calls for concerted research agenda aimed at: Development, analysis, integration, sharing, and simulation of algorithmic or information processing abstractions of natural processes, coupled with formal methods and tools for their analyses and simulation; Innovations in cognitive tools that augment and extend human intellect and partner with humans in all aspects of science.Comment: Computing Community Consortium (CCC) white paper, 17 page

    Characterization of the retinal proteome during rod photoreceptor genesis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The process of rod photoreceptor genesis, cell fate determination and differentiation is complex and multi-factorial. Previous studies have defined a model of photoreceptor differentiation that relies on intrinsic changes within the presumptive photoreceptor cells as well as changes in surrounding tissue that are extrinsic to the cell. We have used a proteomics approach to identify proteins that are dynamically expressed in the mouse retina during rod genesis and differentiation.</p> <p>Findings</p> <p>A series of six developmental ages from E13 to P5 were used to define changes in retinal protein expression during rod photoreceptor genesis and early differentiation. Retinal proteins were separated by isoelectric focus point and molecular weight. Gels were analyzed for changes in protein spot intensity across developmental time. Protein spots that peaked in expression at E17, P0 and P5 were picked from gels for identification. There were 239 spots that were picked for identification based on their dynamic expression during the developmental period of maximal rod photoreceptor genesis and differentiation. Of the 239 spots, 60 of them were reliably identified and represented a single protein. Ten proteins were represented by multiple spots, suggesting they were post-translationally modified. Of the 42 unique dynamically expressed proteins identified, 16 had been previously reported to be associated with the developing retina.</p> <p>Conclusions</p> <p>Our results represent the first proteomics study of the developing mouse retina that includes prenatal development. We identified 26 dynamically expressed proteins in the developing mouse retina whose expression had not been previously associated with retinal development.</p

    RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins

    Get PDF
    Protein-RNA interactions are central to essential cellular processes such as protein synthesis and regulation of gene expression and play roles in human infectious and genetic diseases. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases and functional implications of such interactions and for developing effective approaches to rational drug design. Sequence-based computational methods offer a viable, cost-effective way to identify putative RNA-binding residues in RNA-binding proteins. Here we report two novel approaches: (i) HomPRIP, a sequence homology-based method for predicting RNA-binding sites in proteins; (ii) RNABindRPlus, a new method that combines predictions from HomPRIP with those from an optimized Support Vector Machine (SVM) classifier trained on a benchmark dataset of 198 RNA-binding proteins. Although highly reliable, HomPRIP cannot make predictions for the unaligned parts of query proteins and its coverage is limited by the availability of close sequence homologs of the query protein with experimentally determined RNA-binding sites. RNABindRPlus overcomes these limitations. We compared the performance of HomPRIP and RNABindRPlus with that of several state-of-the-art predictors on two test sets, RB44 and RB111. On a subset of proteins for which homologs with experimentally determined interfaces could be reliably identified, HomPRIP outperformed all other methods achieving an MCC of 0.63 on RB44 and 0.83 on RB111. RNABindRPlus was able to predict RNA-binding residues of all proteins in both test sets, achieving an MCC of 0.55 and 0.37, respectively, and outperforming all other methods, including those that make use of structure-derived features of proteins. More importantly, RNABindRPlus outperforms all other methods for any choice of tradeoff between precision and recall. An important advantage of both HomPRIP and RNABindRPlus is that they rely on readily available sequence and sequence-derived features of RNA-binding proteins. A webserver implementation of both methods is freely available at http://einstein.cs.iastate.edu/RNABindRPlus/

    Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.</p> <p>Results</p> <p>In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).</p> <p>Conclusions</p> <p>The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p

    On Evaluating MHC-II Binding Peptide Prediction Methods

    Get PDF
    Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods
    corecore